SearchUnit Documentation: Crawling Strategy; How To Ensure High Success Rates

Crawling Strategy; How To Ensure High Success Rates

Like all search engine spiders the success rate for crawling is not always 100% - in other words, when crawling web-sites it is not always possible for the spider to find every page automatically, this section includes tips for improving the success rate. It is also worth pointing out that pages can always be added manually through the 'Edit Document List' button in the Index Management tool, or programmatically through the DocumentIndex class.

#1. Think of the spider like a Javascript disabled browser - the spider 'sees' your web-site like a browser, it cannot see pages that are not linked to. It has no access to your directory structure.

#2. Most links written from Javascript are unreadable, because the spider is not a Javascript engine, it doesn't know that your popup menu exists. See #3.

#3. Provide commented out links, or a 'site map' to help the spider. Eg. if your web-site does have a popup menu that uses Javascript, add hidden links like



this won't affect users, but will match one of the spiders regex patterns. These hidden links can be generated by your code at runtime and added to the page.

#4. If the spider reports a problem such as 404 Not Found, or 401 Not Authorized then you should check the URL in a browser (for 404's), and also consult the Forms Authentication section (for 401's).

#5. If need be, by pass the spider and add documents directly using the DocumentIndex class's AddDocument method or manually with the Index Management tool.